Current Issue : October-December Volume : 2025 Issue Number : 4 Articles : 5 Articles
Speech recognition in noisy environments has long posed a challenge. Air conduction microphone (ACM), the devices typically used, are susceptible to environmental noise. In this work, a customized bone conduction microphone (BCM) system based on a piezoelectric micromachined ultrasonic transducer is developed to capture speech through real-time bone conduction (BC), while a commercial ACM is integrated for simultaneous capture of speech through air conduction (AC). The system enables simpler and more robust BC speech capture. The BC speech capture achieves a signal-to-noise amplitude ratio over five times greater than that of AC speech capture in an environment with a noise level of 68 dB. Instead of using only AC-captured speech, both BC- and ACcaptured speech are input into a speech enhancement module. The noise-insensitive BC-captured speech serves as a speech reference to adapt the SE backbone of AC-captured speech. The two types of speech are fused, and noise suppression is applied to generate enhanced speech. Compared with the original noisy speech, the enhanced speech achieves a character error rate reduction of over 20%, approaching the speech recognition accuracy of clean speech. The results indicate that this speech enhancement method based on the fusion of BC- and AC-captured speech efficiently integrates the features of both types of speech, thereby improving speech recognition accuracy in noisy environments. This work presents an innovative system designed to efficiently capture BC speech and enhance speech recognition in noisy environments....
With the development of the marine economy and the increase in marine activities, deep saturation diving has gained significant attention. Helium speech communication is indispensable for saturation diving operations and is a critical technology for deep saturation diving, serving as the sole communication method to ensure the smooth execution of such operations. This study introduces deep learning into helium speech recognition and proposes a spectrogram-based dual-model helium speech recognition method. First, we extract the spectrogram features from the helium speech. Then, we combine a deep fully convolutional neural network with connectionist temporal classification (CTC) to form an acoustic model, in which the spectrogram features of helium speech are used as an input to convert speech signals into phonetic sequences. Finally, a maximum entropy hidden Markov model (MEMM) is employed as the language model to convert the phonetic sequences to word outputs, which is regarded as a dynamic programming problem. We use a Viterbi algorithm to find the optimal path to decode the phonetic sequences to word sequences. The simulation results show that the method can effectively recognize helium speech with a recognition rate of 97.89% for isolated words and 95.99% for continuous helium speech....
In mobility service environments, recognizing the user condition and driving status is critical in driving safety and experiences. While speech emotion recognition is one of the possible features to predict the driver status, current emotion recognition models have a fundamental limitation: they target to classify only single emotion classes, not multi-classes. It prevents the comprehensive understanding of the driver’s condition and intention during driving. In addition, mobility devices inherently generate noises that might affect speech emotion recognition performances in the mobility service. Considering mobility service environments, we investigate possible models that detect multiple emotions while mitigating noise issues. In this paper, we propose a speech-emotion recognition model based on the autoencoder for multi-emotion detection. First, we analyze the Mel Frequency Cepstral Coefficients (MFCCs) to design the specific features. We also develop a multi-emotion detection scheme based on an autoencoder to detect multiple emotions with substantial flexibility compared to existing models. With our proposed scheme, we investigate and analyze mobility noise impacts and mitigation approaches to evaluate performance results....
Background/Objectives: Previous research has shown that listeners may use acoustic cues for speech processing that are perceived during brief segments in the noise when there is an optimal signal-to-noise ratio (SNR). This “glimpsing” effect requires higher cognitive skills than the speech tasks used in typical audiometric evaluations. Purpose: The aim of this study was to investigate the use of an online test of speech processing in noise in listeners with typical hearing sensitivity (TH, defined as thresholds ≤ 25 dB HL) who were asked to determine the gender of the subject in sentences that were presented in increasing levels of continuous and interrupted noise. Methods: This was a repeated-measures design with three factors (SNR, noise type, and syntactic complexity). Study Sample: Participants with self-reported TH (N = 153, ages 18–39 years, mean age = 20.7 years) who passed an online hearing screening were invited to complete an online questionnaire. Data Collection and Analysis: Participants completed a sentence recognition task under four SNRs (−6, −9, −12, and −15 dB), two syntactic complexity settings (subjective-relative and objective-relative center-embedded), and two noise types (interrupted and continuous). They were asked to listen to 64 sentences through their own headphones/earphones that were presented in an online format at a user-selected comfortable listening level. Their task was to identify the gender of the person performing the action in each sentence. Results: Significant main effects of all three factors as well as the SNR by noise-type two-way interaction were identified (p < 0.05). This interaction indicated that the effect of SNR on sentence comprehension was more pronounced in the continuous noise compared to the interrupted noise condition. Conclusions: Listeners with self-reported TH benefited from the glimpsing effect in the interrupted noise even under low SNRs (i.e., −15 dB). The evaluation of glimpsing may be a sensitive measure of auditory processing beyond the traditional word recognition used in clinical evaluations in persons who report hearing challenges and may hold promise for the development of auditory training programs....
The evolution of Natural Language Processing represents a journey from basic statistical methods to advanced artificial intelligence systems. Starting with foundational approaches like Bag of Words and TF-IDF, the field progressed through neural architectures including RNNs and Transformers, culminating in today's large language models. Each advancement has elevated capabilities in language understanding, translation, and generation. The transformation continues through multimodal integration, efficiency enhancements, reasoning improvements, and trustworthy AI development, while addressing fundamental technical challenges that will shape artificial intelligence's future landscape....
Loading....